CBT Campus' Online Skills Training Courses.

IT Skills

Enterprise Database Systems

Data Science

Getting Started with Hadoop

it_dshpfddj_05_enus

it_dshpfddj_02_enus

it_dshpfddj_03_enus

it_dshpfddj_01_enus

it_dshpfddj_04_enus

Getting Started with Hadoop: Advanced Operations Using MapReduce

Course Number:
it_dshpfddj_05_enus

Expected Duration (hours)
0.8

Lesson Objectives

Getting Started with Hadoop: Advanced Operations Using MapReduce

Course Overview
define a vehicle type that can be used to represent automobiles to be stored in a Java PriorityQueue
configure a Mapper to use a PriorityQueue to store the five most expensive vehicles it has processed from the dataset
use a PriorityQueue in the Reducer of the application to receive the five most expensive automobiles from each mapper and write the top 5 vehicles overall to the output
execute the application and examine the output on HDFS to confirm that the five most expensive automobiles have been written out
define the Mapper for a MapReduce application to build an inverted index from a set of text files
configure the Reducer and the Driver for the inverted index application
run the application and examine the inverted index on HDFS
recognize the data structures and configurations involved when extracting the top N values from a data set

Overview/Description

In this Skillsoft Aspire course, explore how MapReduce can be used to extract the five most expensive vehicles in a data set, then build an inverted index for the words appearing in a set of text files. Begin by defining a vehicle type that can be used to represent automobiles to be stored in a Java PriorityQueue, then configure a Mapper to use a PriorityQueue to store the five most expensive automobiles it has processed from the dataset. Learn how to use a PriorityQueue in the Reducer of the application to receive the five most expensive automobiles from each mapper and write the top five automobiles overall to the output, then execute the application to verify the results. Next, explore how you can utilize the MapReduce framework in order to generate an inverted index and configure the Reducer and Driver for the inverted index application. This leads on to running the application and examining the inverted index on HDFS (Hadoop Distributed File System). The concluding exercise involves advanced operations using MapReduce.

Target

Prerequisites: none

Getting Started with Hadoop: Developing a Basic MapReduce Application

Course Number:
it_dshpfddj_02_enus

Expected Duration (hours)
1.2

Lesson Objectives

Getting Started with Hadoop: Developing a Basic MapReduce Application

Course Overview
create and configure a Hadoop cluster on the Google Cloud Platform using its Cloud Dataproc service
work with the YARN Cluster Manager and HDFS NameNode web applications that come packaged with Hadoop
use Maven to create a new Java project for the MapReduce application
develop a Mapper for the word frequency application that includes the logic to parse one line of the input file and produce a collection of keys and values as output
create a Reducer for the application that will collect the Mapper output and calculate the word frequencies in the input text file
specify the configurations of the MapReduce applications in the Driver program and the project's pom.xml file
build the MapReduce word frequency application using Maven to produce a jar file and then prepare for execution from the master node of the Hadoop cluster
run the application and examine the outputs generated to get the word frequencies in the input text document
idenfity the apps packaged with Hadoop and the purposes they serve and recall the classes/methods used in the Map and Reduce phases of a MapReduce application

Overview/Description

In this Skillsoft Aspire course, discover how to use Hadoop's MapReduce; provision a Hadoop cluster on the cloud; and build an application with MapReduce to calculate word frequencies in a text document. To start, create a Hadoop cluster on the Google Cloud Platform using its Cloud Dataproc service; then work with the YARN Cluster Manager and HDFS (Hadoop Distributed File System) NameNode web applications that come packaged with Hadoop. Use Maven to create a new Java project for the MapReduce application, and develop a mapper for word frequency application. Create a Reducer for the application that will collect Mapper output and calculate word frequencies in input text files, and identify configurations of MapReduce applications in the Driver program and the project's pom.xml file. Next, build the MapReduce word frequency application with Maven to produce a jar file and prepare for execution from the master node of the Hadoop cluster. Finally, run the application and examine outputs generated to get word frequencies in the input text document. The exercise involves developing a basic MapReduce application.

Target

Prerequisites: none

Getting Started with Hadoop: Filtering Data Using MapReduce

Course Number:
it_dshpfddj_03_enus

Expected Duration (hours)
1.0

Lesson Objectives

Getting Started with Hadoop: Filtering Data Using MapReduce

Course Overview
create a new project and code up the Mapper for an application to count the number of passengers in each class of the Titanic in the input dataset
develop a Reducer and Driver for the application to generate the final passenger counts in each class of the Titanic
build the project using Maven and run it on the Hadoop master node to check that the output correctly shows the numbers in each passenger class
apply MapReduce to filter through only the surviving passengers on the Titanic from the input dataset
execute the application and verify that the filtering has worked correctly; examine the job and the output files using the YARN Cluster Manager and HDFS NameNode web UIs
use MapReduce to obtain a distinct set of the cuisines offered by the restaurants in a dataset
build and run the application and confirm the output using HDFS from both the command line and the web application
identify configuration functions used to customize a MapReduce and recognize the types of input and output when null values are transmitted from the Mapper to the Reducer

Overview/Description

Extracting meaningful information from a very large dataset can be painstaking. In this Skillsoft Aspire course, learners examine how Hadoop's MapReduce can be used to speed up this operation. In a new project, code the Mapper for an application to count the number of passengers in each Titanic class in the input data set. Then develop a Reducer and Driver to generate final passenger counts in each Titanic class. Build the project by using Maven and run on Hadoop master node to check that output correctly shows passenger class numbers. Apply MapReduce to filter only surviving Titanic passengers from the input data set. Execute the application and verify that filtering has worked correctly; examine job and output files with YARN cluster manager and HDFS (Hadoop Distributed File System) NameNode web User interfaces. Using a restaurant app's data set, use MapReduce to obtain the distinct set of cuisines offered. Build and run the application and confirm output with HDFS from both command line and web application. The exercise involves filtering data by using MapReduce.

Target

Prerequisites: none

Getting Started with Hadoop: Fundamentals & MapReduce

Course Number:
it_dshpfddj_01_enus

Expected Duration (hours)
1.1

Lesson Objectives

Getting Started with Hadoop: Fundamentals & MapReduce

Course Overview
describe what big data is and list the various sources and characteristics of data available today
recognize the challenges involved in processing big data and the options available to address them such as vertical and horizontal scaling
specify the role of Hadoop in processing big data and describe the function of its components such as HDFS, MapReduce, and YARN
identify the purpose and describe the workings of Hadoop's MapReduce framework to process data in parallel on a cluster of machines
recall the steps involved in building a MapReduce application and the specific workings of the Map phase in processing each row of data in the input file
recognize the functions of the Shuffle and Reduce phases in sorting and interpreting the output of the Map phase to produce a meaningful output
recognize the techniques related to scaling data processing tasks, working with clusters, and MapReduce and identify the Hadoop components and their functions

Overview/Description

In this course, learners will explore the theory behind big data analysis using Hadoop, and how MapReduce enables parallel processing of large data sets distributed on a cluster of machines. Begin with an introduction to big data and the various sources and characteristics of data available today. Look at challenges involved in processing big data and options available to address them. Next, a brief overview of Hadoop, its role in processing big data, and the functions of its components such as the Hadoop Distributed File System (HDFS), MapReduce, and YARN (Yet Another Resource Negotiator). Explore the working of Hadoop's MapReduce framework to process data in parallel on a cluster of machines. Recall steps involved in building a MapReduce application and specifics of the Map phase in processing each row of the input file's data. Recognize the functions of the Shuffle and Reduce phases in sorting and interpreting the output of the Map phase to produce a meaningful output. To conclude, complete an exercise on the fundamentals of Hadoop and MapReduce.

Target

Prerequisites: none

Getting Started with Hadoop: MapReduce Applications With Combiners

Course Number:
it_dshpfddj_04_enus

Expected Duration (hours)
1.4

Lesson Objectives

Getting Started with Hadoop: MapReduce Applications With Combiners

Course Overview
recognize the need for combiners to optimize the execution of a MapReduce application by minimizing data transfers within a cluster
recall the steps involved in processing data in a MapReduce application
describe the working of a Combiner in performing a partial reduction of the data that is output from the Mapper
configure a Combiner to optimize a MapReduce application that calculates an average value
use Maven to create a new project for a MapReduce application and plan out the Map and Reduce phases by examining the auto prices dataset
develop the Mapper and Reducer for the application that will calculate the average price for each make of automobile in the input dataset
create the driver program for the MapReduce application
run the MapReduce application and check the output to get the average price for each automobile make
code up a Combiner for the MapReduce application and configure the Driver to use it for a partial reduction on the Mapper nodes of the cluster
fix the bug in the previous application by defining a type that represents both the aggregate price and count of automobiles that can be used to correctly calculate the average price
compare the output of the modified application with the previous buggy version and verify that the average prices for the vehicles are being calculated correctly
identify the shortcomings of regular MapReduce operations which are addressed by Combiners, and how Combiners differ from Reducers

Overview/Description

In this Skillsoft Aspire course, explore the use of Combiners to make MapReduce applications more efficient by minimizing data transfers. Start by learning about the need for Combiners to optimize the execution of a MapReduce application by minimizing data transfers within a cluster. Recall the steps to process data in a MapReduce application, and look at using a Combiner to perform partial reduction of data output from the Mapper. Then create a new project to calculate average automobile prices using Maven for a MapReduce application. Next, develop the Mapper and Reducer to calculate the average price for automobile makes in the input data set. Create a driver program for the MapReduce application, run it, and check output to get the average price per automobile. Learn how to code up a Combiner for a MapReduce application, fix the bug in the application so it can be used to correctly calculate the average price, then run the fixed application to verify that the prices are being calculated correctly. The concluding exercise concerns optimizing MapReduce with Combiners.

Target

Prerequisites: none